NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

tMHG-Finder: Tree-Guided Maximal Homologous Group Finder for Bacterial Genomes

https://doi.org/10.1007/978-3-031-94928-9_6

Yin, Yongze; Kille, Bryce; Ogilvie, Huw A; Treangen, Todd J; Nakhleh, Luay (September 2025, Springer Nature Switzerland)

Free, publicly-accessible full text available September 1, 2026
Graph-based self-supervised learning for repeat detection in metagenomic assembly

https://doi.org/10.1101/gr.279136.124

Azizpour, Ali; Balaji, Advait; Treangen, Todd J; Segarra, Santiago (July 2024, Genome research)

Repetitive DNA (repeats) poses significant challenges for accurate and efficient genome assembly and sequence alignment. This is particularly true for metagenomic data, where genome dynamics such as horizontal gene transfer, gene duplication, and gene loss/gain complicate accurate genome assembly from metagenomic communities. Detecting repeats is a crucial first step in overcoming these challenges. To address this issue, we propose GraSSRep, a novel approach that leverages the assembly graph's structure through graph neural networks (GNNs) within a self-supervised learning framework to classify DNA sequences into repetitive and non-repetitive categories. Specifically, we frame this problem as a node classification task within a metagenomic assembly graph. In a self-supervised fashion, we rely on a high-precision (but low-recall) heuristic to generate pseudo-labels for a small proportion of the nodes. We then use those pseudo-labels to train a GNN embedding and a random forest classifier to propagate the labels to the remaining nodes. In this way, GraSSRep combines sequencing features with predefined and learned graph features to achieve state-of-the-art performance in repeat detection. We evaluate our method using simulated and synthetic metagenomic datasets. The results on the simulated data highlight our GraSSRep's robustness to repeat attributes, demonstrating its effectiveness in handling the complexity of repeated sequences. Additionally, our experiments with synthetic metagenomic datasets reveal that incorporating the graph structure and the GNN enhances our detection performance. Finally, in comparative analyses, GraSSRep outperforms existing repeat detection tools with respect to precision and recall.
more » « less
Full Text Available
Rapid whole genome characterization of antimicrobial-resistant pathogens using long-read sequencing to identify potential healthcare transmission

https://doi.org/10.1017/ice.2024.202

Wu, Chin-Ting; Shropshire, William C; Bhatti, Micah M; Cantu, Sherry; Glover, Israel K; Anand, Selvalakshmi Selvaraj; Liu, Xiaojun; Kalia, Awdhesh; Treangen, Todd J; Chemaly, Roy F; et al (February 2025, Infection Control & Hospital Epidemiology)

Abstract Objective:Whole genome sequencing (WGS) can help identify transmission of pathogens causing healthcare-associated infections (HAIs). However, the current gold standard of short-read, Illumina-based WGS is labor and time intensive. Given recent improvements in long-read Oxford Nanopore Technologies (ONT) sequencing, we sought to establish a low resource approach providing accurate WGS-pathogen comparison within a time frame allowing for infection prevention and control (IPC) interventions. Methods:WGS was prospectively performed on pathogens at increased risk of potential healthcare transmission using the ONT MinION sequencer with R10.4.1 flow cells and Dorado basecaller. Potential transmission was assessed via Ridom SeqSphere+ for core genome multilocus sequence typing and MINTyper for reference-based core genome single nucleotide polymorphisms using previously published cutoff values. The accuracy of our ONT pipeline was determined relative to Illumina. Results:Over a six-month period, 242 bacterial isolates from 216 patients were sequenced by a single operator. Compared to the Illumina gold standard, our ONT pipeline achieved a mean identity score of Q60 for assembled genomes, even with a coverage rate as low as 40×. The mean time from initiating DNA extraction to complete analysis was 2 days (IQR 2–3.25 days). We identified five potential transmission clusters comprising 21 isolates (8.7% of sequenced strains). Integrating ONT with epidemiological data, >70% (15/21) of putative transmission cluster isolates originated from patients with potential healthcare transmission links. Conclusions:Via a stand-alone ONT pipeline, we detected potentially transmitted HAI pathogens rapidly and accurately, aligning closely with epidemiological data. Our low-resource method has the potential to assist in IPC efforts.
more » « less
Free, publicly-accessible full text available February 1, 2026
Parsnp 2.0: scalable core-genome alignment for massive microbial datasets

https://doi.org/10.1093/bioinformatics/btae311

Kille, Bryce; Nute, Michael G; Huang, Victor; Kim, Eddie; Phillippy, Adam M; Treangen, Todd J (May 2024, Bioinformatics)
Schwartz, Russell (Ed.)
Abstract MotivationSince 2016, the number of microbial species with available reference genomes in NCBI has more than tripled. Multiple genome alignment, the process of identifying nucleotides across multiple genomes which share a common ancestor, is used as the input to numerous downstream comparative analysis methods. Parsnp is one of the few multiple genome alignment methods able to scale to the current era of genomic data; however, there has been no major release since its initial release in 2014. ResultsTo address this gap, we developed Parsnp v2, which significantly improves on its original release. Parsnp v2 provides users with more control over executions of the program, allowing Parsnp to be better tailored for different use-cases. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment. The partitioning option can reduce memory usage by over 4× and reduce runtime by over 2×, all while maintaining a precise core-genome alignment. The partitioning workflow is also less susceptible to complications caused by assembly artifacts and minor variation, as alignment anchors only need to be conserved within their partition and not across the entire input set. We highlight the performance on datasets involving thousands of bacterial and viral genomes. Availability and implementationParsnp v2 is available at https://github.com/marbl/parsnp.
more » « less
Full Text Available
Unveiling microbial diversity: harnessing long-read sequencing technology

https://doi.org/10.1038/s41592-024-02262-1

Agustinho, Daniel P.; Fu, Yilei; Menon, Vipin K.; Metcalf, Ginger A.; Treangen, Todd J.; Sedlazeck, Fritz J. (April 2024, Nature Methods)

Full Text Available
KombOver: Efficient k-core and K-truss based characterization of perturbations within the human gut microbiome

https://doi.org/10.1142/9789811286421_0039

Sapoval, Nicolae; Tanevski, Marko; Treangen, Todd J. (December 2023, Pacific Symposium on Biocomputing 2024)

The microbes present in the human gastrointestinal tract are regularly linked to humanhealth and disease outcomes. Thanks to technological and methodological advances in re-cent years, metagenomic sequencing data, and computational methods designed to analyzemetagenomic data, have contributed to improved understanding of the link between thehuman gut microbiome and disease. However, while numerous methods have been recentlydeveloped to extract quantitative and qualitative results from host-associated microbiomedata, improved computational tools are still needed to track microbiome dynamics withshort-read sequencing data. Previously we have proposed KOMB as ade novotool foridentifying copy number variations in metagenomes for characterizing microbial genomedynamics in response to perturbations. In this work, we present KombOver (KO), whichincludes four key contributions with respect to our previous work: (i) it scales to largemicrobiome study cohorts, (ii) it includes bothk-core andK-truss based analysis, (iii)we provide the foundation of a theoretical understanding of the relation between variousgraph-based metagenome representations, and (iv) we provide an improved user experiencewith easier-to-run code and more descriptive outputs/results. To highlight the aforemen-tioned benefits, we applied KO to nearly 1000 human microbiome samples, requiring lessthan 10 minutes and 10 GB RAM per sample to process these data. Furthermore, wehighlight how graph-based approaches such ask-core andK-truss can be informative forpinpointing microbial community dynamics within a myalgic encephalomyelitis/chronic fa-tigue syndrome (ME/CFS) cohort. KO is open source and available for download/use at:https://github.com/treangenlab/komb
more » « less
Full Text Available
Microbial Community Profiling Protocol with Full‐length 16S rRNA Sequences and Emu

https://doi.org/10.1002/cpz1.978

Curry, Kristen D.; Soriano, Sirena; Nute, Michael G.; Villapol, Sonia; Dilthey, Alexander; Treangen, Todd J. (March 2024, Current Protocols)

Abstract 16S rRNA targeted amplicon sequencing is an established standard for elucidating microbial community composition. While high‐throughput short‐read sequencing can elicit only a portion of the 16S rRNA gene due to their limited read length, third generation sequencing can read the 16S rRNA gene in its entirety and thus provide more precise taxonomic classification. Here, we present a protocol for generating full‐length 16S rRNA sequences with Oxford Nanopore Technologies (ONT) and a microbial community profile with Emu. We select Emu for analyzing ONT sequences as it leverages information from the entire community to overcome errors due to incomplete reference databases and hardware limitations to ultimately obtain species‐level resolution. This pipeline provides a low‐cost solution for characterizing microbiome composition by exploiting real‐time, long‐read ONT sequencing and tailored software for accurate characterization of microbial communities. © 2024 Wiley Periodicals LLC. Basic Protocol: Microbial community profiling with Emu Support Protocol 1: Full‐length 16S rRNA microbial sequences with Oxford Nanopore Technologies sequencing platform Support Protocol 2: Building a custom reference database for Emu
more » « less
Full Text Available
Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation

https://doi.org/10.1093/bioinformatics/btad512

Kille, Bryce; Garrison, Erik; Treangen, Todd J; Phillippy, Adam M (September 2023, Bioinformatics)
Robinson, Peter (Ed.)
Abstract MotivationThe Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. ResultsTo address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications. Availability and implementationMashMap3 is available at https://github.com/marbl/MashMap.
more » « less
Full Text Available
Bakdrive: identifying a minimum set of bacterial species driving interactions across multiple microbial communities

https://doi.org/10.1093/bioinformatics/btad236

Wang, Qi; Nute, Michael; Treangen, Todd J. (June 2023, Bioinformatics)

Abstract MotivationInteractions among microbes within microbial communities have been shown to play crucial roles in human health. In spite of recent progress, low-level knowledge of bacteria driving microbial interactions within microbiomes remains unknown, limiting our ability to fully decipher and control microbial communities. ResultsWe present a novel approach for identifying species driving interactions within microbiomes. Bakdrive infers ecological networks of given metagenomic sequencing samples and identifies minimum sets of driver species (MDS) using control theory. Bakdrive has three key innovations in this space: (i) it leverages inherent information from metagenomic sequencing samples to identify driver species, (ii) it explicitly takes host-specific variation into consideration, and (iii) it does not require a known ecological network. In extensive simulated data, we demonstrate identifying driver species identified from healthy donor samples and introducing them to the disease samples, we can restore the gut microbiome in recurrent Clostridioides difficile (rCDI) infection patients to a healthy state. We also applied Bakdrive to two real datasets, rCDI and Crohn's disease patients, uncovering driver species consistent with previous work. Bakdrive represents a novel approach for capturing microbial interactions. Availability and implementationBakdrive is open-source and available at: https://gitlab.com/treangenlab/bakdrive.
more » « less
Enabling accurate and early detection of recently emerged SARS-CoV-2 variants of concern in wastewater

https://doi.org/10.1038/s41467-023-38184-3

Sapoval, Nicolae; Liu, Yunxi; Lou, Esther G.; Hopkins, Loren; Ensor, Katherine B.; Schneider, Rebecca; Stadler, Lauren B.; Treangen, Todd J. (December 2023, Nature Communications)

Abstract As clinical testing declines, wastewater monitoring can provide crucial surveillance on the emergence of SARS-CoV-2 variant of concerns (VoCs) in communities. In this paper we present QuaID, a novel bioinformatics tool for VoC detection based on quasi-unique mutations. The benefits of QuaID are three-fold: (i) provides up to 3-week earlier VoC detection, (ii) accurate VoC detection (>95% precision on simulated benchmarks), and (iii) leverages all mutational signatures (including insertions & deletions).
more » « less
Full Text Available

« Prev Next »

Search for: All records